[WIP] feat(lora): LoRA adapter serving by qywu · Pull Request #83 · lightseekorg/tokenspeed

qywu · 2026-05-11T23:13:18Z

Summary (WIP)

End-to-end LoRA adapter serving for tokenspeed. Branch is not yet rebased on current main — many test files appear as deletions because the last merge from main predates several recent PRs (#18, #51, etc.). Will refresh before un-drafting.

What's in this PR

Scaffolding: feat(lora): scaffold LoRA adapter serving infrastructure.
Prefix-cache namespacing (C++): per-adapter namespacing in the scheduler so two adapters with the same prompt don't collide on cached KV.
HiCache wiring: thread lora_id through hybrid cache paths.
LoraManager: GPU weight pool with LRU eviction, TP-aware adapter application.
HTTP plumbing: lora_path accepted on /v1/completions and /v1/chat/completions; propagated through GenerateReqInput.__getitem__.
MLP target support: gate_proj / up_proj / down_proj LoRA targets in addition to attention QKV/output.
CUDA-graph support: segment-grouped Triton kernels; separate no-LoRA graph variant captured so base-only batches skip the LoRA path.
Tiered pool: GPU ↔ CPU ↔ disk pool with async prefetch.
Pack scheduling: pack policy + cold/warm latency benchmark.
Eager-mode fixes: --enable-lora works without CUDA graphs.
Misc perf: drop pure-PyTorch RMSNorm fallback in qk_norm; evict adapter namespace on unload.
Docs: HTML references for the LoRA implementation and the broader tokenspeed codebase structure.

Status

This is an early draft — opening for visibility and review of the overall shape. Next steps before un-drafting:

Rebase on current main (resolve stale deletions of perf(eviction): O(k log N) eviction via persistent LRU set #18 / feat(deepseek-v4): add scheduler-managed sliding-window cache groups #51 test files).
Add Python-level integration tests for --enable-lora (currently only C++ unit test test_lora_prefix_cache.cpp).
Benchmark numbers: cold-load, warm hit, pack vs no-pack throughput.
Document the HTTP API surface for lora_path in the OpenAI-compat docs.

Test plan

C++ unit test: test_lora_prefix_cache.cpp.
Python E2E: load base + 2 adapters, verify per-adapter outputs, prefix-cache namespace isolation.
TP=2 sanity once the dense-MLP TP fix from PR fix(qwen3): plumb tensor-parallel info through MLP layers #80 is merged (already in main, this branch needs a rebase to pick it up).

Adds the foundational types and API surface for PEFT-style LoRA adapter serving, unblocking the full runtime implementation. New files: python/tokenspeed/runtime/lora/lora_config.py — LoraConfig dataclass; loads from PEFT adapter_config.json; exposes r, lora_alpha, scaling. python/tokenspeed/runtime/lora/lora_registry.py — LoraRegistry tracks loaded adapters, maps names to stable integer IDs, enforces max_loras capacity (pinned adapters bypass the limit). python/tokenspeed/runtime/lora/__init__.py API additions: GenerateReqInput.lora_path — per-request adapter selector (name or path). ServerArgs: --enable-lora, --max-loras, --max-lora-rank. EngineBase.load_lora_adapter() / unload_lora_adapter() — abstract API with NotImplementedError stubs; full implementation tracked in PR #2. Tests: test/runtime/lora/test_lora_registry.py — 11 unit tests covering registration, capacity enforcement, pinning, unregister, scaling. TODO (tracked in PR): - LoraManager: weight loading from safetensors into pre-allocated GPU buffers (one buffer per target module × max_lora_rank). - Request routing: resolve lora_path → lora_id in scheduler. - Batched LoRA matmuls (sgmv / punica kernels or torch fallback). - Engine.load/unload implementations calling LoraManager. - OpenAI API: expose lora_path in /v1/completions and /v1/chat/completions. - C++ scheduler: pass lora_id on requests for prefix-cache namespacing. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…heduler Implements the correct LoRA prefix cache namespace so: • Same adapter + same tokens → cache hit ✓ • Different adapters + same tokens → no cross-adapter hit ✓ Design: per-adapter virtual root node For each lora_id > 0, KVPrefixCache::getOrCreateLoraRoot() creates a child of the real root keyed by a one-page sentinel token [-lora_id, 0, ..., 0]. Negative token IDs never appear in real vocabularies (non-negative), so there is no collision between adapters or with the base-model namespace. An empty DeviceResource is attached to the virtual root so: • OnDevice() == true → PruneEmptyByNode never removes it • IsLeaf() == false → eviction never tries to evict it KVPrefixCache::Match() and Insert() accept a lora_id parameter (default 0) and call resolveStartNode() to obtain the correct namespace root. MatchResult::Device::namespace_depth_offset (new field, default 0) is set to 1 for LoRA requests and subtracted inside DepthInPage() so all callers see the number of real matched token pages, not including the sentinel page. Changes: request_spec.h — add lora_id: int32_t = 0 request.h / request.cpp — store + expose LoraId() kv_prefix_cache.h/cpp — getOrCreateLoraRoot, resolveStartNode, lora_id param on Match + Insert types.h / types.cpp — namespace_depth_offset in MatchResult forward_events.h/cpp — FinishEvent carries lora_id_, passes to Insert/Match forward.cpp — pass request->LoraId() to all Match calls outside_event_handler.cpp — pass req->LoraId() to FinishEvent python_module.cpp — expose lora_id on Python RequestSpec Tests (test_lora_prefix_cache.cpp, 6 cases): SameAdapterReusesPrefixCache DifferentAdaptersDontShareCache BaseModelIndependentOfAdapters MultipleAdaptersCacheIndependently InsertLastNodeIsInAdapterNamespace EvictionDoesNotCrossNamespaces All 120 C++ tests pass. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Three paths were missing lora_id, causing cross-adapter KV cache collisions when the hybrid (Mamba / HiCache) prefix cache is enabled: 1. HybridPrefixCache::Match() — added lora_id param, passes through to KVPrefixCache::Match() so the per-adapter virtual root is used for L2 host-cache matching as well as device matching. 2. InsertHybridCache() — added lora_id param, passes through to KVPrefixCache::Insert() so chunked-prefill inserts land in the correct adapter namespace (previously always defaulted to kLoraNone). 3. SchedulePrefillEvent / ScheduleDecodeEvent — added lora_id_ field; forward.cpp passes request->LoraId() at construction time. Both events call InsertHybridCache() and now supply the adapter id. Also fixes the schedulePrefillFirstChunk hybrid-path Match call which was passing lora_id only on the non-hybrid branch. All 120 C++ tests pass. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…lication Implements the weight management layer for LoRA adapter serving. LoraManager (python/tokenspeed/runtime/lora/lora_manager.py) Pre-allocates a fixed GPU buffer with max_loras+1 slots (slot 0 = base model). load_adapter(name, path): loads PEFT safetensors to CPU, computes scaling from adapter_config.json (lora_alpha / r). unload_adapter(name): zeroes the GPU slot and frees CPU cache. prepare_loras(lora_ids): copies active adapters into GPU slots on demand, returns weight_indices [bs] and scalings [n_slots]; evicts LRU non-pinned adapters when the pool is full. apply_qkv_lora / apply_o_lora: bmm-based delta application, TP-aware (column-parallel projections shard B; row-parallel o_proj shards A and all_reduces the partial output). Model integration (qwen3.py) Qwen3Attention.forward injects LoRA delta after qkv_proj and o_proj when ctx.lora_manager is set. layer_id stored on Qwen3Attention. Context / executor (context.py, model_executor.py) ForwardContext gains lora_weight_indices, lora_scalings, lora_manager. ModelExecutor.execute_forward_op injects LoRA info into ForwardContext when any request in the batch carries a non-zero lora_id. End-to-end routing TokenizedGenerateReqInput.lora_id — integer resolved at tokenize time from GenerateReqInput.lora_path via InputProcessor._resolve_lora_id(). make_spec / RequestSpec.lora_id — scheduler receives per-request adapter id. EventLoop: init_lora_manager(), load_lora_adapter(), unload_lora_adapter(); _request_lora_ids dict tracks rid→lora_id for active requests. RequestHandler: LoadLoraReqInput / UnloadLoraReqInput dispatch via callbacks. scheduler_control_client: load_lora_communicator / unload_lora_communicator + async load/unload methods on AsyncLLM. Engine.load_lora_adapter / unload_lora_adapter: delegate to tokenizer_manager. Tested PEFT reference on GPU 2: adapter_0 (argon) produces the memorized password (Kx7#mP2$-VORTEX93qR-alpha!Z ≈ expected Kx7#mP2$-VORTEX-93qR-alpha!Z). tokenspeed serve --enable-lora starts cleanly on GPU 4,5 and serves requests. Base model correctly ignores adapters when lora_path is not set. TODO (PR #2) - Route lora_path from OpenAI /v1/completions HTTP body through to lora_id. - Full integration test driving greedy output parity with PEFT. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Three fixes needed to run in eager mode (enforce_eager=True, disable_pdl=True which are auto-set when --enable-lora is used): 1. server_args: auto-set disable_pdl=True when enable_lora is set. The TVM-JIT rmsnorm_cute kernel used by the PDL path is JIT-compiled on first call with a fixed dtype; in eager mode the dtype may differ from the CUDA-graph warmup call, causing a Mismatched Tensor error. 2. lora_manager: cast scale to the delta tensor's dtype before multiplying. bfloat16_delta * float32_scale promoted the result to float32, which the rope kernel cannot handle (DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP16 failure). Fix: (delta * scale.to(delta.dtype)). 3. qwen3.py: replace _apply_qk_norm kernel calls with a pure-PyTorch RMSNorm implementation (_rms_norm static method). The flashinfer rmsnorm_cute kernel is JIT-compiled and its cached dtype cannot be changed at runtime; a simple x / rms * weight path avoids the kernel entirely and works with any dtype. Also adds benchmark/test_lora_dynamic.py — end-to-end test demonstrating dynamic load/unload of two adapters while the engine is live. Confirmed: - load_lora_adapter() / unload_lora_adapter() work at runtime - LoRA weights ARE applied (different token IDs at generation position 7+ vs base model: base→ "The password is", argon adapter → "1789...") - Prefix cache namespacing correct (different slots, isolated) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…completions Exposes lora_path in the OpenAI-compatible HTTP API so clients can select a LoRA adapter per request without any server restart. protocol.py - CompletionRequest.lora_path: str | None = None - ChatCompletionRequest.lora_path: str | None = None serving_completions.py / serving_chat.py - Pass request.lora_path to GenerateReqInput so it flows through InputProcessor._resolve_lora_id() → lora_id → scheduler routing. Usage example: curl http://localhost:8000/v1/completions \ -d '{"model":"Qwen/Qwen3-8B","prompt":"...", "lora_path":"argon","max_tokens":30}' model_executor.py - Fix per-token weight_indices expansion for mixed-adapter batches: repeat_interleave(w_idx, input_lengths) so every token in a prefill batch gets its request's correct adapter slot index, not just the first N requests' indices sliced to total_tokens. lora_manager.py - Remove the broken per-token expansion from apply_qkv_lora/apply_o_lora; weight_indices is now always already per-token when it arrives. Single-request broadcast (1→tokens) is preserved. benchmark/test_lora_batch.py - New test: load argon + bastion, verify each produces different token IDs from base model and from each other (adapter isolation proof). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Three correctness/cleanliness fixes to the virtual-root-per-adapter design: 1. Add KVPrefixCache::EvictLoraNamespace(lora_id): DFS-collects all descendant nodes, calls ResourceManager::EvictSubtree() to detach device/host pages (RAII auto-returns them to the allocator), then removes the virtual root via RemoveChild (unique_ptr cascade destroys the subtree including any mamba slots). Exposed as Scheduler::EvictLoraNamespace and bound to Python as scheduler.evict_lora_namespace(lora_id). Called from event_loop.unload_lora_adapter() so pages are freed immediately on unload rather than waiting for LRU pressure. 2. Remove device_.UpdateLeaves(raw) from getOrCreateLoraRoot: the call was a no-op (IsLeaf returns false for the empty-resource virtual root, and updateLeaf(real_root) returns immediately on IsRoot check). 3. Add EvictLoraNamespaceFreesPagesImmediately and EvictLoraNamespaceIdempotent tests. All 122 C++ tests pass. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Replace the per-token bmm LoRA path with sglang/Punica-style segmented Triton kernels (sgemm_lora_a / sgemm_lora_b / qkv_lora_b) and refactor LoraManager around a persistent LoraBatchInfo so the captured CUDA graph can replay against stable buffer pointers. * Move LoraManager creation into ModelExecutor.__init__ so graphs are captured with the LoRA path baked in (slot 0 = no-adapter, zero-delta via rank-0 short-circuit in the kernels). * Bind ctx.lora_manager during _capture_one and pre-fill batch_info with one segment per "request" so all LoRA kernels are recorded. * qwen3 attention now calls apply_qkv_lora / apply_o_lora with just (hidden, qkv, layer_id) — the manager owns batch_info. * Drop the auto-disable of cuda graphs when --enable-lora is set. * Single-GPU Qwen3-8B (TP=1, bs=1, 256 decode tokens, H100): eager+LoRA 36.7 → graph+LoRA 105.5 tok/s (2.87x). Also threads lora_path through Engine.generate so the in-process Engine API matches the HTTP routing that already lands lora_path in GenerateReqInput. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…ving

Commit 126164b reintroduced a manual fp32 RMSNorm in ``_apply_qk_norm`` to dodge a JIT-dtype mismatch in the rmsnorm_cute (PDL) kernel under ``--enable-lora``. Server args already auto-set ``disable_pdl=True`` for that path, so the regular flashinfer ``rmsnorm`` (used by input_layernorm / post_attention_layernorm) is correct here too. Restoring the fused kernel collapses ~7 small launches per call into one. Single-GPU Qwen3-8B (TP=1, bs=1, 256 decode tokens, H100): * eager + base: 47.7 → 57.4 tok/s (+20%) * graph + base: 122.8 → 142.0 tok/s (+16%) * graph + LoRA: 105.5 → 118.8 tok/s (+13%) Profile (eager): qk_norm dropped from 138 us / layer to 39 us / layer (36 layers, 4.97 ms → 1.40 ms per decode step). Aligns this branch with main, which already restored the fused path. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

When --enable-lora is on but no request in the current batch uses an adapter, the captured CUDA graph still includes all the per-layer Triton LoRA kernels (rank-0 short-circuit returns early but each kernel still costs its replay-time launch slot — about ~5% / step). Capture two graphs per batch size: * graphs[bs] — with-LoRA: ctx.lora_manager set, Triton calls baked in. * graphs_no_lora[bs] — same forward without the LoRA path. LoraManager.prepare_loras updates a CPU-side has_active_lora flag from the resolved per-request slots; the wrapper reads it before each replay to pick the right variant. Mixed batches (any segment with rank > 0) fall back to the with-LoRA graph as before. Single-GPU Qwen3-8B (TP=1, bs=1, 256 decode tokens, H100): * graph + no --enable-lora : 142.0 tok/s * graph + --enable-lora, no adapter : 134.5 → 138.4 tok/s * graph + --enable-lora, active adapter : 119.1 tok/s (unchanged) Tradeoffs: 2× capture time at startup (~10s → ~20s); marginal extra graph memory (the activations pool is shared via global_graph_memory_pool). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Extends LoRA to the MLP block of qwen3 in addition to attention. Triton kernels: * New gate_up_lora_b — fused 2-projection B expand for the stacked gate/up MLP linear (analogous to qkv_lora_b for attention). * Reuses sgemm_lora_a (stack_num=2 for gate_up, 1 for down) and sgemm_lora_b (for down's full output expand). LoraManager: * _parse_adapter_weights now matches mlp.{gate,up,down}_proj keys. * New per-layer buffers gate_up_A/B and down_A/B; un-sharded because qwen3 Qwen3MLP runs MergedColumnParallelLinear / RowParallelLinear with tp_size=1 (each rank holds the full intermediate weight). * New apply_gate_up_lora and apply_down_lora — gate_up reuses the fused-B path; down has no internal all-reduce because there's no TP. Bug fix (also affected attention): * The sgemm_lora_a kernel only writes the first ``rank * stack_num`` output cols, and qkv_lora_b / gate_up_lora_b read with stride ``stack_idx * actual_rank`` (after the kernel's K=min(K,rank) cap). _load_to_slot was packing stacks at multiples of MAX rank, which fell outside what the kernels actually read — silently zeroing the k/v deltas (and now would zero up's delta too). Now packs stacks contiguously at ``stack_idx * actual_rank``, matching what sglang's weight loader does (mem_pool.py L873 ``[:lora_rank * c, :]``). Qwen3MLP gains a layer_id and the forward call now threads through ``ctx`` so the LoRA hooks can be invoked. E2E correctness on togethercomputer/Qwen3-8B-LoRA-Password-Adapters (Qwen3-8B, TP=1, bs=1, H100): * attn adapter: ' No other text.\nX7#mP2$VORTEX93qR\n...' (PEFT ref: 'Zx7#mP2$-VORTEX93qR\nNext, please ...') * mlp adapter: ' 73\nKx7#mP2$-VORTEX-93qR\nKx7#mP2$' (PEFT ref: ' 73\nKx7#mP2$-VORTEX-93qR\nKx7#mP2$-...') — bit-for-bit match for the first ~30 tokens. Throughput (256 decode tokens): * graph + base : 142.0 tok/s * graph + attn LoRA (q/k/v/o) : 119.1 tok/s (post-stack-fix; was only-q before, so this is the *correct* number) * graph + mlp LoRA (gate/up/down): 97.5 tok/s * sglang/tgl mlp LoRA: crashes with cudaErrorIllegalAddress on both csgmv and triton backends. Memory: MLP buffers add ~672 MB at ``max_loras=2`` for Qwen3-8B (intermediate=12288, hidden=4096, max_rank=64). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Batched ``engine.generate(prompt=[...], lora_path=[...])`` is split per index by ``async_llm._handle_batch_request`` via ``obj[i]``. The ``__getitem__`` method built the per-request sub-object but dropped ``lora_path``, so every sub-request ran as base model regardless of which adapter the caller asked for. Mixed-batch test on togethercomputer/Qwen3-8B-LoRA-Password-Adapters (4 adapters + 1 base prompt in a single ``generate`` call): * before: 1/5 — only the base-model row passed; all four adapter rows produced base-model output. * after: 4/5 — three adapter rows emit their project's password fragment, base row correctly does not. The remaining failure is a flaky adapter (bastion is just noisy under greedy decode — same behavior in isolation), not a routing bug. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Adds a CPU pinned-memory tier between the GPU LoRA buffers and the adapter's disk path. Adapters now flow: disk (always) → CPU pool (max_loras_cpu) → GPU pool (max_loras) * CPU pool is bounded; LRU eviction drops the cached parsed weights and relies on _adapter_paths[name] to reload on next use. The disk path is the source of truth and is assumed durable (S3 backing is a natural future replacement). * Pinned adapters (passed `pinned=True` at load time) are protected from CPU eviction; non-pinned GPU-resident adapters can be CPU-evicted when the pool is otherwise full (their weights are still on GPU; a future GPU re-promotion costs a disk read). Eviction prefers non-GPU-resident candidates first. * Async prefetch hooks request admission: when a request with ``lora_id != 0`` is admitted, the manager kicks off a disk read on a ThreadPoolExecutor so the safetensors I/O is overlapped with the previous forward step instead of blocking ``prepare_loras`` of the step that consumes it. prepare_loras joins an in-flight prefetch instead of double-reading. Toggle with ``TOKENSPEED_LORA_PREFETCH=0``. * New server args: --max-loras-cpu default 4 × max_loras --lora-scheduling-policy {lru} for now; the dispatch point stays in event_loop for future 'admission' / 'pack' policies. * Validation: max_loras_cpu must be ≥ max_loras (every GPU-resident adapter is also tracked in the CPU LRU; if max_loras_cpu == max_loras the policy-2 step lets us evict GPU-resident adapters from CPU when needed, instead of locking the pool). E2E test (Qwen3-8B, max_loras=2, max_loras_cpu=2, three adapters sequenced so the first is CPU-evicted then re-requested): * 1st argon: ' Kx7#mP2$-VORTEX93qR' → PASS (initial) * 1st citadel: 'Tf3!hR6^-PRISM-27bK' → PASS * dagger: HELIX-fragments → noisy under greedy decode * 2nd argon (after CPU eviction + disk reload): ' Zx7#mP2$-VORTEX93qR' → PASS, matches the PEFT reference. 29 unit tests pass (incl. 8 new tests covering CPU LRU, disk reload, pinned protection, prefetch path, and unload tear-down). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Adds the ``pack`` lora scheduling policy and a benchmark that characterises the cost of each residence tier so users can size ``--max-loras-cpu`` for their workload. Benchmark (Qwen3-8B, TP=1, max_loras=2, max_loras_cpu=3, max_lora_rank=64, H100 80GB, 1-token decode): warm: ~43 ms cpu-resident: ~43 ms (CPU→GPU copy is <1 ms, lost in the forward) cold (disk): ~72 ms (~30 ms safetensors read + parse) Findings: * CPU promotion is essentially free, so once an adapter is in the CPU pool there is no measurable per-request cost. Sizing ``max_loras_cpu`` to cover the working set eliminates the cold-disk hit entirely. * Async prefetch only matters under multi-request concurrency: in serial single-request mode the prefetch's disk read still blocks the consuming request's prepare_loras. ``pack`` policy: in ``_process_new_requests`` the admitted-spec list is stable-sorted by lora_id when ``--lora-scheduling-policy=pack``, so adapter-shared requests cluster at the C++ scheduler. Reduces GPU/CPU eviction churn when ``working_set > max_loras_cpu`` and traffic is bursty enough to put multiple cold requests in one event-loop iter. ``lru`` (default) keeps arrival order. Skipped the ``admission`` policy: the benchmark shows GPU promotion is free, so gating batches that don't fit in GPU buys nothing — the only real eviction cost is CPU→disk, and that is already controlled by ``max_loras_cpu``. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…erving # Conflicts: # python/tokenspeed/runtime/execution/model_executor.py # python/tokenspeed/runtime/models/qwen3.py # tokenspeed-scheduler/CMakeLists.txt # tokenspeed-scheduler/bindings/python_module.cpp # tokenspeed-scheduler/csrc/fsm/forward_events.cpp

…erving Signed-off-by: Qingyang Wu <qingyang@together.ai> # Conflicts: # python/tokenspeed/runtime/engine/io_struct.py # python/tokenspeed/runtime/entrypoints/openai/protocol.py # python/tokenspeed/runtime/entrypoints/openai/serving_chat.py # python/tokenspeed/runtime/entrypoints/openai/serving_completions.py # tokenspeed-scheduler/CMakeLists.txt # tokenspeed-scheduler/csrc/resource/kv_prefix_cache/kv_prefix_cache.cpp # tokenspeed-scheduler/csrc/resource/kv_prefix_cache/kv_prefix_cache.h

EvictSubtree referenced the old `leaves_` set removed by lightseekorg#18; switch to the timestamp-keyed lru_leaves_/node_time_ cleanup used by updateLeaf so the scheduler core compiles again and pip's editable build of tokenspeed-scheduler succeeds. Also apply clang-format 18.1.3 to files touched by the LoRA merge so the lint job passes. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…erving Resolved conflicts in KV/Hybrid prefix cache Match signatures by composing both new params: lora_id (this branch, per-adapter namespacing) and intent (main, distinguishes PrefixReuse from StateRecovery for retracted-request recovery). Both call sites in forward.cpp (scheduleDecodeFromRetracted and the post-allocation re-match) now pass request->LoraId() together with MatchIntent::StateRecovery so retracted LoRA requests recover from their own adapter namespace. Also merged ForwardContext: kept the new last_index_offsets field from main alongside the lora_manager field on this branch. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Per AGENTS.md the runtime should only cross the kernel boundary through tokenspeed-kernel, and Triton imports should funnel through _triton.py. Relocates the segment-grouped LoRA kernels from python/tokenspeed/runtime/lora/triton_ops/ to tokenspeed-kernel/python/tokenspeed_kernel/ops/gemm/lora_triton/ and swaps the `import triton` lines for `from tokenspeed_kernel._triton`. LoraManager now imports its kernels from the kernel package. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Two TP-correctness fixes uncovered when verifying the Qwen3-8B-LoRA-Password-Adapters e2e suite at attn_tp_size=2. 1. Qwen3MLP is now TP-aware (gate_up_proj column-parallel, down_proj row-parallel; see runtime/models/qwen3.py). The LoRA buffers and slice offsets assumed the un-sharded layout, causing a shape mismatch in sgemm_lora_a during CUDA-graph capture and incorrect adapter semantics if the assert had not fired. The fix introduces intermediate_per_tp and: - sizes gate_up_B_buffers to (2 * intermediate_per_tp, r) per slot, - sizes down_A_buffers to (r, intermediate_per_tp) per slot, - passes intermediate_per_tp to gate_up_lora_b_fwd (the kernel already expected the per-rank output dim), - extends _shard_weights to slice MLP B (gate/up, column) and MLP A (down, row) the same way attention modules already were. 2. apply_o_lora previously computed the *full* B @ A @ x by all-reducing lora_a internally, then added that full delta to a partial base output. The host's downstream all-reduce in post_attention_layernorm then summed the delta tp_size times — pre-existing bug acknowledged in the old docstring, manifesting as garbled output for any attention adapter at TP > 1. Drop the internal all-reduce so each rank emits a partial (B @ A_local @ x_local) and rely on the existing downstream all-reduce to sum partials correctly; comm_all_reduce import is no longer needed. Verified e2e against Qwen3-8B with attention and MLP adapters from togethercomputer/Qwen3-8B-LoRA-Password-Adapters at attn_tp_size=2: both modes produce the exact target passwords; base model does not leak the secret; same-adapter re-queries after a different adapter is loaded still resolve through the right namespace. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Adds ``@triton.autotune`` to all four LoRA kernels (``sgemm_lora_a``, ``sgemm_lora_b``, ``qkv_lora_b``, ``gate_up_lora_b``), keyed on the (output_dim, K) shape pair that drives tile selection. The candidate config sweep matches the space sglang found productive in sgl-project/sglang#20391 (shrink: BLOCK_N×BLOCK_K×warps×stages; expand: adds maxnreg for occupancy) plus a BLOCK_S axis since our kernel exposes it. Picks survive process restarts via ``configs/<gpu>/<kernel>.json`` checked into the package — on import ``load_kernel_cache`` populates ``Autotuner.cache`` so production never pays the sweep cost. The ``tune.py`` driver runs each kernel with decode-shaped batches (``bs=32, max_len=1``) for the Qwen3-8B shapes at attn_tp_size=2 and writes the JSON; re-run it on a new GPU or model to extend the cache. Bench on the lora_active config (Qwen3-8B, attn_tp=2, 32 prompts × 128 out tokens, password adapter on every request): base 5517 tok/s 23.2 ms/req --enable-lora, no lora_path 5210 tok/s 24.6 ms/req --enable-lora, lora_path (orig) 3201 tok/s 40.0 ms/req --enable-lora, lora_path (tuned) 3279 tok/s 39.0 ms/req (+2.4%) A modest win — the workload is decode-dominated (bs=32 single-token segments), where launch overhead and per-step ``prepare_loras`` work dwarf the block-size choice for these small matmuls. Tuning at prefill-shaped batches (bs=4, max_len=32) regressed by ~5%, confirming that the block sizes are decode-vs-prefill sensitive; the committed configs target decode. Larger wins are still possible against the non-kernel parts of the LoRA path (per-step host work, kernel launch count) but those are out of scope here. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

``sgemm_lora_a``/``sgemm_lora_b`` was misleading on two axes — ``sgemm`` is BLAS for "single-precision (fp32) GEMM" (our kernel is bf16/fp16), and ``_a``/``_b`` is PEFT terminology that's only obvious to LoRA specialists. Replace with operation-name files that read at first glance: sgemm_lora_a.py -> lora_shrink.py (in_dim -> r) sgemm_lora_b.py -> lora_expand.py (r -> out_dim) qkv_lora_b.py -> lora_qkv_expand.py (fused QKV expand) gate_up_lora_b.py -> lora_gate_up_expand.py (fused gate/up expand) Public ``*_fwd`` functions, internal ``_*_kernel`` symbols, and the per-GPU autotune JSON config filenames follow the same scheme. The PEFT-style attribute names inside ``lora_manager.py`` (``qkv_A_buffers``, ``o_B_buffers``, etc.) and the tensor-parameter names in the kernel signatures (``qkv_lora_b``, ``gate_up_lora_b``) stay — those legitimately reference the PEFT ``lora_A``/``lora_B`` decomposition, not the operation. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

LoRA isn't really a GEMM variant — it's its own op family that happens to use segmented matmuls under the hood. Hosting the kernels under ``ops/gemm/lora_triton/`` overloaded the gemm family with LoRA-specific buffers, batch_info, and Triton helpers. Promote LoRA to a top-level family that follows the ``<family>/<solution>`` convention already used by ``ops/attention/triton/``: ops/gemm/lora_triton/ → ops/lora/triton/ The kernel files, autotune configs, ``tuning.py`` cache loader, and ``tune.py`` driver all move together; only the import path changes. ``lora_manager.py`` in the runtime is updated to import from the new location. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

The four LoRA Triton kernels (and ``kernel_utils.py``) were adapted from sglang's ``python/sglang/srt/lora/triton_ops/`` (Apache-2.0), which in turn descends from the Punica S-LoRA design. Add file-level provenance notes — upstream path, URL, license — and a package-level pointer in ``__init__.py``. No code changes; attribution only. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Follow-up to the ops/lora/triton/ restructure — update the runtime manager to import from the new location instead of ops/gemm/lora_triton. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Add chunked_sgmv_expand_fwd — a unified LoRA-B expand kernel that covers plain, QKV, and gate/up projections via a NUM_SLICES constexpr and a slice_offsets boundary tensor. Making OUTPUT_DIM, MAX_RANK, NUM_SLICES, and all strides constexpr lets the compiler specialise the K-loop trip count at compile time, giving 2–3× speedup at prefill with rank ≥ 64 vs the runtime-stride decode kernels. lora_manager dispatches on batch_info.max_len > 32: decode steps always use the existing tuned kernels (11–25 µs); prefill uses chunked_sgmv. Slice-offset tensors for each projection type are pre-allocated in __init__ so dispatch adds zero per-step overhead, and the captured decode CUDA graph is unaffected (max_len = 1 is always below the threshold). Benchmarked on H100 at Qwen3-8B TP=2 shapes: prefill s=512 rank=64 QKV expand: 62 µs → 19 µs (3.3×) prefill s=512 rank=64 gate/up: 110 µs → 35 µs (3.1×) decode s=1 rank=64 (unchanged): 34 µs Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Add lora_expand_grouped_v2_fwd: adapts vLLM's token-sorted dispatch pattern (grid axis-1 = num_active_adapters) to eliminate the gather/scatter overhead of lora_expand_decode_fwd. Key design: • x and output accessed at scattered original token positions via token_indices — no pre-gather or post-scatter needed • Grid: (cdiv(M, BLOCK_S) × cdiv(N, BLOCK_N), num_groups) — tiles both M and N, matching vLLM's parallelism structure • CTAs beyond a group's token count exit immediately (same early-exit as vLLM's lora_expand_kernel) • Constexpr strides + tl.multiple_of EVEN_K from our prior work Benchmarked vs vLLM inline + old grouped kernel (rank=64, N=4096, H100): n= 32 n_unique=4: grpv2= 9.8µ vllm=11.3µ seg=22.2µ (+12% vs vllm) n= 64 n_unique=4: grpv2= 10.4µ vllm=12.1µ seg=36.2µ (+14% vs vllm) n=128 n_unique=4: grpv2= 12.7µ vllm=13.2µ seg=63.8µ (+ 4% vs vllm) n=128 n_unique=1: grpv2= 11.0µ vllm=11.0µ seg=62.9µ (tied) grpv2 wins in the common n_unique ≤ n/4 regime; vllm wins marginally at extreme n_unique=n (all unique) corner cases, which the existing dispatch threshold (bs // num_groups >= 8) already routes to segmented. Replaces lora_expand_decode_fwd at both dispatch sites in lora_manager. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…iscompute When the autotuner benchmarks BLOCK_K=64 for MAX_RANK=16, the original K // BLOCK_K = 0 caused zero loop iterations and a silent no-op (correct base_output returned but LoRA delta omitted). The autotune then picked this config as 'fastest' since it did nothing. Fix: revert K // BLOCK_K -> tl.cdiv(K, BLOCK_K) and restore k_rem masks so all BLOCK_K configs produce correct results. Configs with BLOCK_K > K are now slower (one masked iteration) and the autotuner naturally avoids them in favour of BLOCK_K <= rank configs. Verified: 176/176 correctness checks pass across n in {1..128}, n_unique in {1..n}, rank in {16,32,64,128}, N in {4096,8192}. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com> # Conflicts: # python/tokenspeed/runtime/models/qwen3.py

Summary of changes in this commit: lora_expand_grouped_v2.py (correctness fix): Restore tl.cdiv(K, BLOCK_K) + k-masks from K // BLOCK_K, preventing the autotuner from selecting BLOCK_K > rank configs which silently produced zero-delta outputs. Verified 176/176 correctness checks pass across n ∈ {1..128}, n_unique ∈ {1..n}, rank ∈ {16,32,64,128}, N ∈ {4096,8192}. lora_manager.py: Switch o_proj and down_proj decode dispatch from lora_expand_decode_fwd (gather/scatter) to lora_expand_grouped_v2_fwd (scattered reads, no copy). Add adapter-group metadata (sort_order, group_slots, group_starts, group_sizes, num_groups) to prepare_loras for the new kernel. lora_expand.py / lora_qkv_expand.py / lora_gate_up_expand.py: Add BLOCK_K ∈ {64, 128} to expand config spaces (profiling showed 0% BW utilisation — instruction-bound; BLOCK_K=64 eliminates the K-loop for rank=64 when combined with tl.cdiv). bench_vs_vllm.py, profile_expand.py: Benchmark and profiling scripts comparing vs vLLM kernels. End-to-end numbers (H100, rank=64): Decode n=32 expand grpv2 vs original: 11.2 µs → was 25.1 µs (2.24×) Decode n=128 expand grpv2 vs original: 14.2 µs → was 63.0 µs (4.45×) Prefill s=512 QKV expand vs original: 28.8 µs → was 61.0 µs (2.12×) Prefill s=512 shrink vs original: 16.7 µs → was 23.4 µs (1.40×) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…eshold needed The grouped v2 kernel previously used M = batch_info.bs (total tokens) for the grid M dimension. For n_unique = n (all different adapters) this launched cdiv(n, BLOCK_S) × cdiv(N, BLOCK_N) × n CTAs with (BLOCK_S-1)/BLOCK_S wasted per group, making it 2-3× slower than segmented. Fix: use M = max(group_sizes) (pre-computed on CPU, no GPU sync) instead of batch_info.bs. When every group has 1 token (seg-like case), max_group_size=1 → grid = (1 × cdiv(N,BLOCK_N), n) — identical to the segmented layout with zero wasted CTAs. The kernel now handles both extremes: n_unique = 1 (same adapter): max_gs=n → grpv2 layout, full M-tiling n_unique = n (all different): max_gs=1 → segmented layout, no waste n_unique = 4 (typical): max_gs=n/4 → compact 4× fewer CTAs Removes the _TRITON_GROUPED_DECODE_MIN_GROUP_SIZE = 32 threshold (set to 1) since the kernel is now safe and optimal for all group sizes. Results (rank=64, N=4096): n=128 n_uniq=128 (seg-like): grpv2≈seg 75.5µ vs 76.9µ (1.02×) n=128 n_uniq= 4 (typical): grpv2 wins 15.0µ vs 63.3µ (4.23×) n=128 n_uniq= 32: grpv2 wins 27.4µ vs 66.8µ (2.44×) Also adds max_group_size: int to LoraBatchInfo and sets it in prepare_loras. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

lora_expand_decode_fwd (gather/scatter grouped expand) was fully replaced by lora_expand_grouped_v2_fwd and is no longer referenced anywhere in production code. Remove the file and clean up the __init__ export and doc references. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…perf fixes Core changes: - adapter_io: parse PEFT lora_embedding_A/B keys for lm_head; add LORA_HEAD_LAYER_ID sentinel - lora_buffers: add 'lm_head' buffer group (lm_head_A/B_buffer, vocab_per_tp dim); column-parallel TP sharding - lora_manager: add apply_lm_head_lora (single-slot matmul fast path; bmm fallback for mixed slots); skip H2D copies + cumsum in prepare_loras when has_active_lora=False - logits_processor: wire apply_lm_head_lora before TP all-gather in _get_logits - moe/layer: raise NotImplementedError for non-Triton backends with active LoRA - server_args: add 'lm_head' to valid lora_buffer_groups; remove stale disable_pdl=True override (PDL works correctly with LoRA) Tests: test_qwen3_lora_password_adapters covers attn/mlp/lm_head adapter types under sequential, batched, high-concurrency, and mixed-batch scenarios (72 subtests) Perf: n_active=0 cudagraph now matches baseline (1170 vs 1171 tok/s) after removing two wasted GPU ops per step and re-enabling PDL Signed-off-by: Qingyang Wu <qingyang@together.ai>

…totune picks - Add BLOCK_S=8 to _EXPAND_CONFIGS: decode batches have S=8 tokens/segment, so BLOCK_S=16 wastes half the tile. The autotuner now considers the decode-optimal tile size. - Cache autotune picks for N=6144 (QKV expand, q+2kv=4096+1024+1024) and N=24576 (gate_up expand, 2×12288) on H100 80GB HBM3. Both shapes were previously missing, triggering a live 648-config sweep on every fresh process. New picks consistently use BLOCK_S=8 for decode workloads. Signed-off-by: Qingyang Wu <qingyang@together.ai>

…H100 Add BLOCK_S=8 to _SHRINK_CONFIGS so the autotuner considers decode-batch tile sizes. Re-ran autotune for all 16 Qwen3-8B shapes (rank 16/32/64/128 × K=4096/12288, QKV/gate_up stacks). Unlike the expand kernel, the shrink kernel is K-bandwidth-bound (large hidden_size read), so BLOCK_S=16 remains optimal — amortising the K-dimension read across more output rows wins. No config changes from re-tuning; cache updated to reflect the wider search. Signed-off-by: Qingyang Wu <qingyang@together.ai>

…mark - test_qwen3_moe_per_expert_lora_password_adapters: tests per_expert format (128 experts × 48 layers, 36864 weight tensors) with max_loras=2 due to ~1.96 GB/slot buffer footprint on Qwen3-30B-A3B. Covers sequential, 2-adapter batch, high-concurrency same-adapter, and mixed LoRA/base scenarios. - bench_moe_lora_decode: compares per_expert vs sglang_shared formats across n_active=0/1/2/4/8 vs no-LoRA baseline. Results in 0521_moe_lora_results.md. Key finding: both formats ~83 tok/s at n_active=1 (−63% vs 226 tok/s baseline); n_active>2 collapses to ~2 tok/s due to CPU↔GPU swap thrashing with max_loras=2. Signed-off-by: Qingyang Wu <qingyang@together.ai>

…ton baseline - bench_moe_lora_decode.py: parallel benchmark using mp.Process across 8 GPUs, batched 8-at-a-time. Tests per_expert vs sglang_shared formats, n_active=0/1/2, eager vs cudagraph, TP=1 vs TP=2, baseline with auto and triton moe_backend. - 0521_moe_lora_results.md: partial results — all TP=1 eager done; cudagraph n_active=2 succeed; baseline/n_active=0/1 cudagraph OOM during graph capture; TP=2 triton baseline cudagraph: 1012 tok/s. - test_qwen3_moe_per_expert_lora_password_adapters.py: correctness test for per_expert format (128 experts × 48 layers), max_loras=2 due to buffer size. Signed-off-by: Qingyang Wu <qingyang@together.ai>

All configs collected after sequential retry with corrected gpu_util (0.75–0.82 for TP=1/2 LoRA to fit model + buffers in 80 GB H100). Key results: TP=2 cudagraph per_expert n=1: 1081 tok/s (+1.25x vs TP=1); CUDA graph vs eager LoRA: +11.8x; per_expert ≈ sglang_shared throughout. Signed-off-by: Qingyang Wu <qingyang@together.ai>

Signed-off-by: Qingyang Wu <qingyang@together.ai>

…format When w13_A.shape[0] == 1 (sglang_shared gate/up shared outer A), replace: selected_A = w13_A[safe_ids] # materialises (m, k, r, h) gather — 1 GB at rank=256 lora_a = einsum("mh,mkrh->mkr", ...) with: lora_a = (hidden @ A.T).unsqueeze(1).expand(-1, k, -1).contiguous() which reads A once and broadcasts over top_k with no data copy. Same for shared B (down_proj in sglang_shared): (m*k, r) @ B.T instead of gather + einsum. Eliminates ~2/4 large gather tensors per forward pass, reducing data movement from ~45 GB → ~1 GB per prefill step at rank=256. Measured improvement: sglang_shared TTFT at rank=256 drops 553 ms → 266 ms (−52%) on Qwen3-30B-A3B TP=2 eager. Correctness verified (8 tests, 30 subtests). Signed-off-by: Qingyang Wu <qingyang@together.ai>

1. Shared A/B fast path: when w13_A or w13_B shape[0]==1 (sglang_shared format), skip the O(m*k*r*h) gather tensor and use a single matmul. For gate_up: hidden@A.T broadcast over top_k; for down: lora_a@B.T. 2. Gate_up B all-experts GEMM: when A is shared, compute delta for ALL E experts in one GEMM (m,r)@(r,E*n)->(m,E,n), then gather by safe_ids. Replaces 960 tiny per-pair GEMMs with one efficient batched matmul. 3. Remove torch.any(valid) sync: the early-return check caused 96 GPU->CPU synchronisations per prefill, stalling the pipeline on every layer call. Replaced with a lazy validity mask (built only when token_mask is set). 4. Vectorised scatter operations: _add_route_delta and _route_rows_from_cache replaced boolean-index gather with clamp+masked_fill+add_ / scatter_ with dummy row, giving 2-3x speedup (128us->57us, 135us->43us per call). Result (TP=2 cudagraph, Qwen3-30B-A3B, bs=8, rank=16): sglang_shared n=1: TTFT 113ms -> 73ms (-35%), tput 975 -> 1093 tok/s per_expert n=1: TTFT 116ms -> 76ms (-34%), tput 1081 -> 1110 tok/s n=0 (no active lora): TTFT within 2-4% of baseline (target: <=10%) Correctness: 8 tests, 30 subtests passed. Signed-off-by: Qingyang Wu <qingyang@together.ai>

Signed-off-by: Qingyang Wu <qingyang@together.ai>

The bmm reshape introduced a shape error: selected_A.reshape(mk, i, -1).T gave (mk, r, i) which is incompatible with (mk, 1, i) in bmm — should be (mk, i, r). Reverted to torch.einsum which is correct. The sync removal and scatter optimizations remain in place. Signed-off-by: Qingyang Wu <qingyang@together.ai>

Two changes to make MoE LoRA work with CUDA graph decode: 1. cuda_graph_wrapper: force has_active_lora=True and single_lora_slot=0 when capturing the with-LoRA graph (previously captured with has_active_lora=False → LoRA kernels omitted from graph entirely). 2. moe_lora: add dynamic GPU tensor indexing path for CUDA-graph-compatible weight access. MoeLoraContext now holds references to the per-layer buffer tensors (w13_A_buffers etc.). For single-slot batches, weight access uses weight_indices[:1].clamp(0) — a GPU tensor read captured in the graph that resolves to the correct adapter slot at replay time. Multi-slot batches fall back to Python dict lookup (eager-only path, unchanged). Result: sglang_shared n=1 CUDA graph correctly generates 'PHOENIX-441...' (password adapter output) instead of base model output. Throughput: 734 tok/s vs 75 tok/s eager — ~10× speedup with correct LoRA application. Signed-off-by: Qingyang Wu <qingyang@together.ai>

…calings Three micro-optimizations in _apply_gate_up_slot and _apply_down_slot: 1. Defer lora_a = expand+contiguous: only compute the (m,k,r) expanded view when the per-expert B path actually needs it. The all-experts and shared-B paths use lora_a_m (m,r) directly, saving one contiguous-copy kernel per layer. 2. Use repeat_interleave instead of expand+contiguous for shared-B: produces (m*k,r) directly without an intermediate non-contiguous expand. 3. Reuse slot_idx (already computed for w13_A/B gathers) for the scalings lookup, avoiding one extra clamp+gather kernel per layer. These changes reduce the CUDA kernel count per layer but the measured step time improvement is within noise (~0.03ms) — the bottleneck is the ~1100 aggregate LoRA kernels per decode step whose individual execution times sum to ~5ms regardless of small per-kernel reductions. Signed-off-by: Qingyang Wu <qingyang@together.ai>

Adds two Triton kernels in tokenspeed-kernel for the sglang_shared MoE LoRA format that activate when the sorted (TMA) MoE dispatch is in use: 1. fused_gate_up_expand: replaces the all-experts B GEMM + candidates.gather + _add_route_delta chain (3 kernels) with a single per-sorted-position GEMV that writes directly to the sorted output. Reads only active experts' B weights (~40% of the all-experts GEMM bandwidth). Measured 2.3× faster in microbenchmark (71μs vs 166μs per layer call, rank=16). 2. fused_down_shrink: replaces _route_rows_from_cache + _select_expert_weights + einsum with a per-sorted-position GEMV on the sorted intermediate, then uses scatter_add_ to accumulate into token-ordered down output. Eliminates the 43μs route scatter. Measured 2.4× faster in microbenchmark (61μs vs 146μs per layer call, rank=16). Both kernels tile the rank dimension in BLOCK_R chunks to bound register pressure for large ranks (r=16 to r=256). Both are CUDA-graph compatible: scaling is loaded from a device tensor so graphs replay with updated adapters. Integration in moe_lora.py: - Activated when sorted_token_ids is not None (TMA-enabled MoE config) - Single-slot CUDA graph path only (multi-slot falls back to existing code) - Graceful import fallback (_FUSED_MOE_LORA_AVAILABLE) if kernel build fails All existing accuracy tests pass (test_qwen3_moe_per_expert_lora, 18 subtests). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

moe_dispatch pre-allocates sorted_token_ids for all num_experts (128) potential slots regardless of actual active experts. This produces a tensor larger than intermediate_cache (allocated with a tighter upper bound based on actual token count). Without truncation, _route_rows_from _cache and _add_route_delta failed with shape mismatches, and _apply_down _sorted had an OOB on topk_weights when sorted_token_ids contained the sentinel value route_count (not just -1). Fixes: - _route_rows_from_cache: truncate sti to cache.shape[0] before scatter_ - _add_route_delta: truncate sti to output.shape[0] before gather - _apply_down_sorted: truncate to inter_flat.shape[0], clamp flat_j_safe to [0, route_count-1] (sentinel route_count value → OOB without it) Also adds the down MoE kernel config file for Qwen3-30B-A3B on H100 with USE_TMA=false for all batch sizes. TMA creates large padded intermediates (up to 16× for BS=8 decode), which degrades decode performance; the fused LoRA kernels activate only with sorted dispatch (USE_TMA=true), which is appropriate for large-M prefill workloads rather than small-batch decode. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Adds _flat_gate_up_expand_kernel for the decode (flat-pair, non-TMA) path. Unlike the all-experts GEMM which reads all 128 experts' B weights and then gathers, this kernel computes tok = pid_s // K directly inside the kernel, issues one GEMV per (token, topk) pair against only the assigned expert's B rows. Active-expert B bandwidth: ~51 experts × 1536 × 32 × 2 = 5 MB vs the all-experts GEMM reading 12.6 MB (60% less). Microbenchmark: 20μs vs 68μs per call (3.4×), saving 2.3ms in the isolated gate/up path (×48 layers). No argsort or sorted_token_ids overhead — computes the (tok, topk_v) lookup from pid_s inline. Works for any batch size and is CUDA-graph compatible. Also retains the SCATTER-mode variant of _fused_gate_up_expand_kernel for future TMA sorted-dispatch use. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

… gathers Profiling showed 4 separate buffer gathers per layer × 48 layers = 3.08ms/step: w13_A gather 22µs (copies 131 KB, needed for cuBLAS A GEMM) w13_B gather 38µs (copies 12.6 MB, now eliminated by buffer+slot) down_A gather 22µs (copies 1.6 MB, now eliminated by buffer+slot) down_B gather 22µs (copies 131 KB, now eliminated by flat_down_expand) After this commit only the w13_A gather / cuBLAS GEMM chain remained. Two more changes eliminate it: 1. flat_a_gemm: new Triton kernel that computes lora_a_m = hidden @ w13_A_buffer[slot, 0, :, :].T directly from the buffer without a prior 131 KB gather copy. Grid (m, R//BLOCK_R). ~5-8µs vs 47µs (22µs gather + 25µs cuBLAS). 2. scalings-buffer: _flat_gate_up_expand_kernel and _flat_down_expand_kernel now receive the full (n_slots,) scalings tensor and load scalings[slot] internally, eliminating the 19µs scalings[slot_idx] gather per layer × 2. Combined e2e improvement (BS=8, sglang_shared rank=16, H100 TP=2): Before: 927 tok/s, 8.63 ms/step, LoRA overhead 2.90 ms After: 1107 tok/s, 7.22 ms/step, LoRA overhead 1.50 ms (−48%) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…merge Main's scheduler changes (PagedCacheGroupFamily rename, KVPrefixCache API changes) are not compatible with the LoRA branch's scheduler extensions. Restore all scheduler csrc and bindings to the branch versions to keep the build working. These scheduler changes can be reconciled separately. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…rsions Signed-off-by: Qingyang Wu <willqywu@gmail.com>

… and attn_output_gate Three fixes for compatibility with Qwen3.5-35B-A3B LoRA adapters: 1. moe_lora.py (_load_2d_adapter_layer): auto-detect transposed A/B convention. Qwen3.5 per-expert adapters store A as (in_features, rank) and B as (rank, out_features), opposite the standard PEFT (rank, in_features) / (out_features, rank) layout. Detect by shape: if A.shape[0] > A.shape[1] or B.shape[0] < B.shape[1], transpose before loading. 2. lora_manager.py: use model_config.head_dim when available. Qwen3.5-35B-A3B has hidden_size=2048, num_attention_heads=16, but head_dim=256 (decoupled). The old head_dim = hidden // n_heads = 128 was wrong, causing kv_size_per_tp=128 instead of 256. 3. lora_manager.py: account for attn_output_gate in q_size_per_tp. When attn_output_gate=True, the QKV proj uses 2×num_heads Q heads (q_proj output = 8192 for this model), but o_proj still uses 1× (o_proj input = 4096). q_size_per_tp is doubled; o_in_per_tp is not. Results on Qwen3.5-35B-A3B with Qwen3.5-35B-A3B-LoRA-Password-Adapters (TP=2): sglang_shared_outer: 8/8 PASS per_expert: 8/8 PASS (required the transpose fix) self_attn: 0/8 FAIL (adapter is inherently weak, loss ~1e-4 not 0) Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu and others added 30 commits May 7, 2026 08:14

docs: add LoRA implementation HTML reference

3df2b49

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

docs: add tokenspeed codebase structure HTML reference

879ab71

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Merge remote-tracking branch 'origin/main' into feat/lora-adapter-ser…

a6be351

…ving

Merge branch 'main' into feat/lora-adapter-serving

2c1573c

fix(lora): update import path to match kernel refactor

18bf9dc

Follow-up to the ops/lora/triton/ restructure — update the runtime manager to import from the new location instead of ops/gemm/lora_triton. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu added 30 commits May 19, 2026 18:09

Merge remote-tracking branch 'upstream/main' into pr-83-resolve

03f1759

Signed-off-by: Qingyang Wu <willqywu@gmail.com> # Conflicts: # python/tokenspeed/runtime/models/qwen3.py

Add MoE LoRA buffer tests and docs

bc60b53

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

bench(moe-lora): add TP=2 cudagraph 1000-token results

6fbf50d

Signed-off-by: Qingyang Wu <qingyang@together.ai>

docs: add 0522_results.md with MoE LoRA optimization summary

d4890de

Signed-off-by: Qingyang Wu <qingyang@together.ai>

docs: update 0522_results.md with fused kernel measurements

dc1c945

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Merge branch 'main' into feat/lora-adapter-serving

1179558

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

fix(merge): restore scheduler Python files to branch version

09ca5c1

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

fix(merge): restore event_loop.py and scheduler_utils.py to branch ve…

afd6ec7

…rsions Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat(lora): LoRA adapter serving#83

[WIP] feat(lora): LoRA adapter serving#83
qywu wants to merge 69 commits into
lightseekorg:mainfrom
qywu:feat/lora-adapter-serving

qywu commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented May 11, 2026

Summary (WIP)

What's in this PR

Status

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant